1 Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

1.1 Load the data

This data is part of the openintro textbook and we can load and inspect it. There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?yrbss

data(yrbss)
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
yrbss %>% mutate(race = factor(race))
## # A tibble: 13,583 × 13
##      age gender grade hispanic race    height weight helmet_12m text_while_driv…
##    <int> <chr>  <chr> <chr>    <fct>    <dbl>  <dbl> <chr>      <chr>           
##  1    14 female 9     not      Black …  NA      NA   never      0               
##  2    14 female 9     not      Black …  NA      NA   never      <NA>            
##  3    15 female 9     hispanic Native…   1.73   84.4 never      30              
##  4    15 female 9     not      Black …   1.6    55.8 never      0               
##  5    15 female 9     not      Black …   1.5    46.7 did not r… did not drive   
##  6    15 female 9     not      Black …   1.57   67.1 did not r… did not drive   
##  7    15 female 9     not      Black …   1.65  132.  did not r… <NA>            
##  8    14 male   9     not      Black …   1.88   71.2 never      <NA>            
##  9    15 male   9     not      Black …   1.75   63.5 never      <NA>            
## 10    15 male   10    not      Black …   1.37   97.1 did not r… <NA>            
## # … with 13,573 more rows, and 4 more variables: physically_active_7d <int>,
## #   hours_tv_per_school_day <chr>, strength_training_7d <int>,
## #   school_night_hours_sleep <chr>
skimr::skim(yrbss)
Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 ▁▂▅▅▇
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 ▁▅▇▃▁
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 ▆▇▂▁▁
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 ▆▂▅▃▇
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 ▇▂▅▂▅

Before you carry on with your analysis, it’s is always a good idea to check with skimr::skim() to get a feel for missing values, summary statistics of numerical variables, and a very rough histogram.

1.2 Exploratory Data Analysis

You will first start with analyzing the weight of participants in kilograms. Using visualization and summary statistics, describe the distribution of weights. How many observations are we missing weights from?

#number of missing values in the 'weight' column:
sumna <- sum(is.na(yrbss['weight']))
print(paste0("there are " , sumna ," missing values." ))
## [1] "there are 1004 missing values."
#using mosaic's favstats
favstats(~weight, data=yrbss)
##   min   Q1 median   Q3 max mean   sd     n missing
##  29.9 56.2   64.4 76.2 181 67.9 16.9 12579    1004
#as it is also apparent in the favstat's function output, there are 1004 missing values in the weight column. 

#histogram of the weight distribution
  ggplot(yrbss, aes(x=weight))+
  geom_histogram(bindwidth=5, color ="black", fill = "white")+
  labs (title = "Histogram of Weigth Distribution",
        y = "Count",
        x = "Weight (in kilograms)")+ 
  theme_bw()+
  geom_vline(aes(xintercept = mean(weight, na.rm = TRUE)), color="blue", linetype="dashed", size=1)+
  NULL

#density plot of the weight distribution
  ggplot(yrbss, aes(x=weight)) +
  geom_density() +
  labs (title = "Density Plot of Weight Distribution",
        x = "Weight (in kilograms)")+
  theme_bw()+
  NULL

Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Let’s create a new variable in the dataframe yrbss, called physical_3plus , which will be yes if they are physically active for at least 3 days a week, and no otherwise. You may also want to calculate the number and % of those who are and are not active for more than 3 days. Use the count() function and see if you get the same results as group_by()... summarise()

yrbss %>%
  filter(!is.na(physically_active_7d))%>%
  ggplot(aes(x = factor(physically_active_7d))) +
  geom_boxplot(aes(y=weight))+
  labs (title = "Distribution of Weight by Number of Active days in a week",
        x = "Number of Active Days in a Week",
        y = "weight (kg)") +
  theme_bw()

  #adding the new column that shows if a student is physically active 3 or more days a week. 
  physical <- yrbss %>% 
    mutate(physical_3plus = case_when(physically_active_7d >= 3 ~ "yes",
                                       physically_active_7d < 3 ~ "no"),
          physical_3plus = factor(physical_3plus, levels = c("yes", "no")))  
 
 count_physical <- physical %>% 
   filter(!is.na(physical_3plus)) %>% 
   count(physical_3plus) %>% 
   mutate(prop = n/sum(n))
 
 #5685, 2656 
 
 prop.test(count_physical$n[2]  , count_physical$n[2] + count_physical$n[1])
## 
##  1-sample proportions test with continuity correction
## 
## data:  [ out of +count_physical$n out of count_physical$n[2]2 out of count_physical$n[1]
## X-squared = 1522, df = 1, p-value <2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.323 0.339
## sample estimates:
##     p 
## 0.331

Can you provide a 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week?

TEAM ANSWER:

The 95 percent confidence interval according to the prop.test function is from 0.3228978 to 0.3389583.

data: [ out of +count_physical\(n out of count_physical\)n[2]2 out of count_physical$n[1] X-squared = 1522, df = 1, p-value <2e-16 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.323 0.339 sample estimates: p 0.331

Initial plotting of the data doesn’t suggest a strong correlation between weekly physical activity levels and weight.

Make a boxplot of physical_3plus vs. weight. Is there a relationship between these two variables? What did you expect and why?

#removing rows that contain NAs from the physical table... 
   physical <- physical %>% 
   filter(!is.na(physical_3plus) & !is.na(weight))
   
#box plot of physical_3plus vs weight:
  
  ggplot(physical, aes(x=physical_3plus, y = weight)) +
  geom_boxplot()+
  labs (title = "Distribution of Weight, Based on weekly activity level",
        x = "Working out 3 or more days: Yes or No",
        y = "weight (kg)") +
  theme_bw()

TEAM ANSWER:

People who work out 3 days a week or more are slightly heavier on average. One might expect the opposite, but weight is not a great indicator of fitness levels as it varries widely with height, and the number of workout days might not indicate the intensity of the workouts either. We would expect a stronger negative correlation between bodyfat percentage and number of active days a week.

1.3 Confidence Interval

Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test. Note that when we calculate the mean, SD, etc. weight in these groups using the mean function, we must ignore any missing values by setting the na.rm = TRUE.

set.seed(1234)
#calculating CI using formulas:

#physical_3plus == "yes
 yes_result<- physical %>% 
   filter(physical_3plus == "yes") %>% 
   summarise( yes_mean = mean(weight, na.rm = TRUE),
              yes_n = count(physical_3plus == "yes"),
              yes_sd = sd(weight, na.rm = TRUE),
              t_critical = qt(0.975, yes_n -1),
              se_yes = yes_sd / sqrt(yes_n),
              margin_of_error = t_critical * se_yes,
              yes_low_CI= yes_mean - margin_of_error,
              yes_high_CI= yes_mean + margin_of_error
              )

print(paste0("the confidence interval is ",  as.numeric(yes_result['yes_low_CI']), " - ", as.numeric(yes_result['yes_high_CI']) ))
## [1] "the confidence interval is 68.094807893518 - 68.8021328880692"
#we can get a very similar result using the bootstrap approach for confidence intervals by infer package:

boot_dist1 <- physical %>% 
   filter(physical_3plus == "yes") %>% 

  specify(response = weight) %>%
  # Generate bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  # Calculate mean of each bootstrap sample
  calculate(stat = "mean")

boot_dist1
## Response: weight (numeric)
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  68.2
##  2         2  68.3
##  3         3  68.4
##  4         4  68.4
##  5         5  68.8
##  6         6  68.0
##  7         7  68.5
##  8         8  68.4
##  9         9  68.6
## 10        10  68.6
## # … with 990 more rows
percentile_ci1 <- get_ci(boot_dist1, level = 0.95, type ="percentile") 

#visualisation of the resulting bootstrap distribution and the CIs
  visualize(boot_dist1) +
  shade_ci(endpoints=percentile_ci1, fill="khaki") +
  geom_vline(xintercept = yes_result$yes_low_CI, colour ="red") +
  geom_vline(xintercept = yes_result$yes_high_CI, colour = "red") +
  NULL

#physical_3plus == "no"
no_result <-physical %>% 
   filter(!is.na(physical_3plus)) %>%
   filter(physical_3plus == "no") %>% 
   summarise( no_mean = mean(weight, na.rm = TRUE),
              no_n = count(physical_3plus == "no"),
              no_sd = sd(weight, na.rm = TRUE),
              t_critical = qt(0.975, no_n -1),
              se_no= no_sd / sqrt(no_n),
              margin_of_error = t_critical * se_no,
              no_low_CI= no_mean - margin_of_error,
              no_high_CI= no_mean + margin_of_error)

print(paste0("the confidence interval is ",  as.numeric(no_result['no_low_CI']), " - ", as.numeric(no_result['no_high_CI']) ))
## [1] "the confidence interval is 66.1286201312585 - 67.2191521213521"
#we can also get a very similar result using the bootstrap approach for confidence intervals by infer package:

boot_dist2 <- physical %>% 
   filter(!is.na(physical_3plus)) %>% 
   filter(physical_3plus == "no") %>% 
   specify(response = weight) %>%
  # Generate bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  # Calculate mean of each bootstrap sample
  calculate(stat = "mean")

percentile_ci <- get_ci(boot_dist2, level = 0.95, type ="percentile") 

#visualisation of the resulting bootstrap distribution and the CIs
  visualize(boot_dist2) +
  shade_ci(endpoints=percentile_ci, fill="khaki") +
  geom_vline(xintercept = no_result$no_low_CI, colour ="red") +
  geom_vline(xintercept = no_result$no_high_CI, colour = "red") +
  labs(title="Simulation-Based Bootsrap Distribution for No Answers") +
  NULL

There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

1.4 Hypothesis test with formula

Write the null and alternative hypotheses for testing whether mean weights are different for those who exercise at least 3 times a week and those who don’t.

H0 = there is no difference in the weight of those who exercise 3 times a week and more compaed to those who do not. h1 = there is a difference in the weight of those who exercise 3 times a week and more and those who do not.

t.test(weight ~ physical_3plus, data = physical)
## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = 5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group yes and group no is not equal to 0
## 95 percent confidence interval:
##  1.12 2.42
## sample estimates:
## mean in group yes  mean in group no 
##              68.4              66.7

Since the T value is 5 we can reject the NULL hypothesis, which means that we statistically sufficient evidence to say that there is a difference in the mean weight of those who work out more than 3 days a week and those who don’t.

1.5 Hypothesis test with infer

Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.

But first, we need to initialize the test, which we will save as obs_diff.

obs_diff <- physical %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.77

Notice how you can use the functions specify and calculate again like you did for calculating confidence intervals. Here, though, the statistic you are searching for is the difference in means, with the order being yes - no != 0.

After you have initialized the test, you need to simulate the test on the null distribution, which we will save as null.

  null_dist <- physical %>%
  # specify variables
  specify(weight ~ physical_3plus) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("yes", "no"))

Here, hypothesize is used to set the null hypothesis as a test for independence, i.e., that there is no difference between the two population means. In one sample cases, the null argument can be set to point to test a hypothesis relative to a point estimate.

Also, note that the type argument within generate is set to permute, which is the argument when generating a null distribution for a hypothesis test.

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

Now that the test is initialized and the null distribution formed, we can visualise to see how many of these null permutations have a difference of at least obs_stat of 1.77.

We can also calculate the p-value for your hypothesis test using the function infer::get_p_value().

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

This the standard workflow for performing hypothesis tests.

2 IMDB ratings: Differences between directors

Recall the IMBD ratings data. I would like you to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. I have already calculated the confidence intervals for the mean ratings of these two directors and as you can see they overlap.

First, I would like you to reproduce this graph. You may find geom_errorbar() and geom_rect() useful.

In addition, you will run a hypothesis test. You should use both the t.test command and the infer package to simulate from a null distribution, where you assume zero difference between the two.

Before anything, write down the null and alternative hypotheses, as well as the resulting test statistic and the associated t-stat or p-value. At the end of the day, what do you conclude?

H0 = there is no difference between the mean IMBD ratings of Steven Spielberg and Tim Burton H1 = there is a difference between the mean IMBD ratings of Steven Spielberg and Tim Burton

You can load the data and examine its structure

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies) 
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Your R code and analysis should go here. If you want to insert a blank chunk of R code you can just hit Ctrl/Cmd+Alt+I

compare2 <- movies %>% 
  filter(director %in% c("Tim Burton", "Steven Spielberg"))   %>% 
  group_by(director) %>% 
   summarise(mean = round(mean(rating), 2),
              n = count(director),
              sd = sd(rating),
              t_critical = qt(0.975, n - 1),
              se = sd / sqrt(n),
              margin_of_error = t_critical * se,
              low_CI= round(mean - margin_of_error, 2) ,
              high_CI= round(mean + margin_of_error, 2)
              )%>%
  mutate(director = factor(director, levels = c("Tim Burton","Steven Spielberg")))

graph <- ggplot(compare2, aes(colour=director)) +
  geom_errorbar(aes(xmin = low_CI, xmax = high_CI, y= director), width = 0.1, size = 1.5)  +
  scale_color_manual(values = c("skyblue","tomato"))+
  geom_point(aes(x=mean, y=director), size = 3 ) +
  labs(title="Do Spielberg and Burton have the same IMDB ratings?",
       subtitle="95% confidence intervals overlap",
       x="Mean IMDB Rating",
       y =" ") +
  geom_text(aes(label = low_CI, x=low_CI, y=director), size = 4, color="black", hjust = 1, vjust = 0, nudge_x = 0.05, nudge_y = 0.08) +
  geom_text(aes(label = high_CI, x=high_CI, y=director),size = 4, color="black", hjust = 1, vjust = 0, nudge_x = 0.05, nudge_y = 0.08) +
  geom_text(aes(label = mean, mean, y=director), size = 6, color="black", hjust = 1, vjust = 0, nudge_x = 0.05, nudge_y = 0.08)+
geom_rect( mapping=aes(xmin= 7.27, xmax= 7.33, ymin=0, ymax=3), color="lightgrey", alpha=0.2) +
  theme_bw()
  
  
graph +theme(legend.position = "none") 

set.seed(1234)

#initial observation table:
 movies %>% 
  filter(director %in% c("Steven Spielberg", "Tim Burton")) %>% 
  group_by(director) %>% 
  summarise(n = n(),
            mean = mean(rating),
            sd = sd(rating))  
## # A tibble: 2 × 4
##   director             n  mean    sd
##   <chr>            <int> <dbl> <dbl>
## 1 Steven Spielberg    23  7.57 0.695
## 2 Tim Burton          16  6.93 0.749
compare <- movies %>% 
  filter(director %in% c("Steven Spielberg", "Tim Burton")) 

#CI for both
compare2 <- movies %>% 
  filter(director %in% c("Tim Burton", "Steven Spielberg"))   %>% 
  group_by(director) %>% 
   summarise(mean = mean(rating),
              n = count(director),
              sd = sd(rating),
              t_critical = qt(0.975, n - 1),
              se = sd / sqrt(n),
              margin_of_error = t_critical * se,
              low_CI= mean - margin_of_error,
              high_CI= mean + margin_of_error,
              )


#t-test 
t.test(rating ~ director, data = compare)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93
#CI: 0.1596624 1.1256637, t=2.714



obs_diff2 <- compare %>%
  specify(rating ~ director) %>%
  calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

obs_diff2
## Response: rating (numeric)
## Explanatory: director (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.643
#hypothesis testing with infer:

  null_dist_movies <- compare %>%
  # specify variables
  specify(rating ~ director) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
  
#visualise the null distribution 
null_dist_movies %>% visualize() +
  shade_p_value(obs_stat = obs_diff2, direction = "two-sided")

null_dist_movies %>%
  get_p_value(obs_stat = obs_diff2, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.008

3 Omega Group plc- Pay Discrimination

At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.

You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.

3.1 Loading the data

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
skim(omega)
Data summary
Name omega
Number of rows 50
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary 0 1 68717 8638.2 47033 63303.16 68847 74777.7 84576 ▁▃▇▆▅
experience 0 1 14 11.9 0 2.25 15 20.8 44 ▇▃▅▂▁

3.2 Relationship Salary - Gender ?

The data frame omega contains the salaries for the sample of 50 executives in the company. Can you conclude that there is a significant difference between the salaries of the male and female executives?

Note that you can perform different types of analyses, and check whether they all lead to the same conclusion

. Confidence intervals . Hypothesis testing . Correlation analysis . Regression

Calculate summary statistics on salary by gender. Also, create and print a dataframe where, for each gender, you show the mean, SD, sample size, the t-critical, the SE, the margin of error, and the low/high endpoints of a 95% condifence interval

# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
##   gender   min    Q1 median    Q3   max  mean   sd  n missing
## 1 female 47033 60338  64618 70033 78800 64543 7567 26       0
## 2   male 54768 68331  74675 78568 84576 73239 7463 24       0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, the t-critical value, the standard error, the margin of error, and the low/high endpoints of a 95% condifence interval

sumstat <- omega %>% 
  group_by(gender) %>% 
   summarise(mean = mean(salary),
              n = count(gender),
              sd = sd(salary),
              t_critical = qt(0.975, n - 1),
              se = sd / sqrt(n),
              margin_of_error = t_critical * se,
              low_CI= mean - margin_of_error,
              high_CI= mean + margin_of_error,
              )

What can you conclude from your analysis? A couple of sentences would be enough

The confidence intervals for the mean salaries by gender do not overlap meaning we have statistically sufficient evidence to conclude that there is a difference between in mean salary by gender. We would assume that in running a hypothesis test, our p-value will be less than 0.05 but we will run the hypothesis test to double-check.

You can also run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money. You should tun your hypothesis testing using t.test() and with the simulation method from the infer package.

# hypothesis testing using t.test() 
t.test(salary ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239
# hypothesis testing using infer package
 set.seed(1234)
 
  null_dist_salary <- omega %>%
  # specify variables
  specify(salary ~ gender) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("female", "male"))
  

  
  obs_diff3 <- omega %>%
  specify(salary ~ gender) %>%
  calculate(stat = "diff in means", order = c("female", "male"))

obs_diff3
## Response: salary (numeric)
## Explanatory: gender (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 -8696.
#visualise the null distribution 
null_dist_salary %>% visualize() +
  shade_p_value(obs_stat = obs_diff3, direction = "two-sided")

null_dist_salary %>%
  get_p_value(obs_stat = obs_diff3, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

What can you conclude from your analysis? A couple of sentences would be enough

Though the infer package returns a 0 p_value, it is impossible to have a p value of 0. Infer package sometimes returns p_values of 0 when the p-value is very small. When running the t-test, we got a p-value of 2e-04 which is very low and it confirms our hypothesis that inferpackage rounded down the p-value to 0. In conclusion the null hypothesis is rejected and that there is a difference between the salaries of men and women.

3.3 Relationship Experience - Gender?

At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0

Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?

H0 = There is no difference between the mean experience of males and females. mean(f)-mean(m) = 0 HA = There is a difference between the mean experience of males and females. mean(f)-mean(m)!= 0

#manual CI calculation:
sumstatExp <- omega %>% 
  group_by(gender) %>% 
   summarise(mean = mean(experience),
              n = count(gender),
              sd = sd(experience),
              t_critical = qt(0.975, n - 1),
              se = sd / sqrt(n),
              margin_of_error = t_critical * se,
              low_CI= mean - margin_of_error,
              high_CI= mean + margin_of_error,
              )

#t-test:
t.test(experience ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12

Confidence intervals do not overlap and t-test gives a p value of 1e-05, so that we can reject the null hypothesis and there is a statistically sufficient evidence to conclude that there is a difference between the mean experience of men and women at work.

3.4 Relationship Salary - Experience ?

Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.

Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data

ggplot(omega, aes(x=experience, y=salary )) +
  geom_point()+
  geom_smooth()+
  labs(title="Scatterplot of Experience vs Salary")+
  theme_bw()

There’s a positive relationship between experience and salary up until 30 years, and then the trend line seems to flatten.

3.5 Check correlations between the data

You can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

Look at the salary vs experience scatterplot. What can you infer from this plot? Explain in a couple of sentences

Women have a lot less experience and less salary and females have a stronger correlation between their experience and salary than males. Whereas males have more experience but there is less correlation between salary and experience. It is interesting that women do not work in the company as long. We could look into why women have less experience at the company. Is it because they are having kids? or is it because they are not being promoted to management roles because they are women?

4 Challenge 1: Brexit plot

Using your data manipulation and visualisation skills, please use the Brexit results dataframe (the same dataset you used in the pre-programme assignement) and produce the following plot. Use the correct colour for each party; google “UK Political Party Web Colours” and find the appropriate hex code for colours, not the default colours that R gives you.

brexit <- read_csv(here::here("data", "brexit_results.csv"))
glimpse(brexit)
## Rows: 632
## Columns: 11
## $ Seat        <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham and Sale W…
## $ con_2015    <dbl> 50.6, 52.0, 53.0, 44.0, 60.8, 22.4, 52.5, 22.1, 50.7, 53.0…
## $ lab_2015    <dbl> 18.3, 22.4, 26.7, 34.8, 11.2, 41.0, 18.4, 49.8, 15.1, 21.3…
## $ ld_2015     <dbl> 8.82, 3.37, 8.38, 2.98, 7.19, 14.83, 5.98, 2.42, 10.62, 5.…
## $ ukip_2015   <dbl> 17.87, 19.62, 8.01, 15.89, 14.44, 21.41, 18.82, 21.76, 19.…
## $ leave_share <dbl> 57.9, 67.8, 38.6, 65.3, 49.7, 70.5, 59.9, 61.8, 51.8, 50.3…
## $ born_in_uk  <dbl> 83.1, 96.1, 90.5, 97.3, 93.3, 97.0, 90.5, 90.7, 87.0, 88.8…
## $ male        <dbl> 49.9, 48.9, 48.9, 49.2, 48.0, 49.2, 48.5, 49.2, 49.5, 49.5…
## $ unemployed  <dbl> 3.64, 4.55, 3.04, 4.26, 2.47, 4.74, 3.69, 5.11, 3.39, 2.93…
## $ degree      <dbl> 13.87, 9.97, 28.60, 9.34, 18.78, 6.09, 13.12, 7.90, 17.80,…
## $ age_18to24  <dbl> 9.41, 7.33, 6.44, 7.75, 5.73, 8.21, 7.82, 8.94, 7.56, 7.61…
head(brexit)
## # A tibble: 6 × 11
##   Seat          con_2015 lab_2015 ld_2015 ukip_2015 leave_share born_in_uk  male
##   <chr>            <dbl>    <dbl>   <dbl>     <dbl>       <dbl>      <dbl> <dbl>
## 1 Aldershot         50.6     18.3    8.82     17.9         57.9       83.1  49.9
## 2 Aldridge-Bro…     52.0     22.4    3.37     19.6         67.8       96.1  48.9
## 3 Altrincham a…     53.0     26.7    8.38      8.01        38.6       90.5  48.9
## 4 Amber Valley      44.0     34.8    2.98     15.9         65.3       97.3  49.2
## 5 Arundel and …     60.8     11.2    7.19     14.4         49.7       93.3  48.0
## 6 Ashfield          22.4     41.0   14.8      21.4         70.5       97.0  49.2
## # … with 3 more variables: unemployed <dbl>, degree <dbl>, age_18to24 <dbl>
brexit <- brexit %>% 
 
select(c("con_2015", "lab_2015", "ld_2015", "ukip_2015", "leave_share"))

names(brexit) <- c("Conservative" ,"Labour", "Lib_Dems", "UKIP", "Leave_Share")


long_brexit <- pivot_longer(brexit, c("Conservative" ,"Labour", "Lib_Dems", "UKIP"), names_to="party")

ggplot(long_brexit, aes(x = value, y = Leave_Share, color=party ) ) +
  geom_smooth(size=0.7, method = "lm" ) +
  geom_point(size=0.7, alpha = 0.3) +
  labs( title = "How political affiliation translated to Brexit voting",
          x = "Part % in the UK 2015 general election",
          y = "Leave share in the UK 2016 general election") +
  theme_light() +
  theme(legend.position = "bottom") +
  scale_colour_manual(labels =  c("Conservative" ,"Labour", "Lib_Dems", "UKIP", "Leave_Share") ,
                      values = c("#0087DC","#E4003B", "#FAA61A", "#EFE600"))

#Challenge 2: CDC COVID-19 Public Use Data

The CDC Covid-19 Case Surveillance Data is a case surveillance public use dataset with 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors. You can see the variables from

There are well over 28 million entries of individual, and we will work with SQLlite database, rather than a CSV file. I would like you to produce two graphs that show death % rate:

  1. by age group, sex, and whether the patient had co-morbidities or not
  2. by age group, sex, and whether the patient was admited to Intensive Care Unit (ICU) or not.

To do this, you will have to think what dplyr verbs to use to select, filter, group_by, etc. You will then use the example shown in https://mam2022.netlify.app/reference/reference_sql/#establish-a-connection-with-the-sqlite-database-1 to use dplyr, dbplyr, and ggplot to produce these graphs.

5 Challenge 2:GDP components over time and among countries

At the risk of oversimplifying things, the main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports). You can read more about GDP and the different approaches in calculating at the Wikipedia GDP page.

The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries from 1970 to today. We will look at how GDP and its components have changed over time, and compare different countries and how much each component contributes to that country’s GDP. The file we will work with is GDP and its breakdown at constant 2010 prices in US Dollars and it has already been saved in the Data directory. Have a look at the Excel file to see how it is structured and organised

UN_GDP_data  <-  read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
                sheet="Download-GDPconstant-USD-countr", # Sheet name
                skip=2) # Number of rows to skip

The first thing you need to do is to tidy the data, as it is in wide format and you must make it into long, tidy format. Please express all figures in billions (divide values by 1e9, or \(10^9\)), and you want to rename the indicators into something shorter.

make sure you remove eval=FALSE from the next chunk of R code– I have it there so I could knit the document

tidy_GDP_data  <-  UN_GDP_data %>% 
  pivot_longer(cols = 4:51,
               names_to = "year") %>% 
  mutate(value=value/1e9) 

glimpse(tidy_GDP_data)
## Rows: 176,880
## Columns: 5
## $ CountryID     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ Country       <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
## $ IndicatorName <chr> "Final consumption expenditure", "Final consumption expe…
## $ year          <chr> "1970", "1971", "1972", "1973", "1974", "1975", "1976", …
## $ value         <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.09, 6.…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")

First, can you produce this plot?

skimr::skim(tidy_GDP_data)
Data summary
Name tidy_GDP_data
Number of rows 176880
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1 4 34 0 220 0
IndicatorName 0 1 17 88 0 17 0
year 0 1 4 4 0 48 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CountryID 0 1.00 439.2 254 4 214.00 440.0 660.0 894 ▇▇▇▇▆
value 15421 0.91 72.2 447 -568 0.36 2.5 17.9 17349 ▇▁▁▁▁
unique(tidy_GDP_data$IndicatorName)
##  [1] "Final consumption expenditure"                                                           
##  [2] "Household consumption expenditure (including Non-profit institutions serving households)"
##  [3] "General government final consumption expenditure"                                        
##  [4] "Gross capital formation"                                                                 
##  [5] "Gross fixed capital formation (including Acquisitions less disposals of valuables)"      
##  [6] "Exports of goods and services"                                                           
##  [7] "Imports of goods and services"                                                           
##  [8] "Gross Domestic Product (GDP)"                                                            
##  [9] "Agriculture, hunting, forestry, fishing (ISIC A-B)"                                      
## [10] "Mining, Manufacturing, Utilities (ISIC C-E)"                                             
## [11] "Manufacturing (ISIC D)"                                                                  
## [12] "Construction (ISIC F)"                                                                   
## [13] "Wholesale, retail trade, restaurants and hotels (ISIC G-H)"                              
## [14] "Transport, storage and communication (ISIC I)"                                           
## [15] "Other Activities (ISIC J-P)"                                                             
## [16] "Total Value Added"                                                                       
## [17] "Changes in inventories"
indicator_list <- c("Gross capital formation",
                    "Exports of goods and services",
                    "Imports of goods and services",
                    "Household consumption expenditure (including Non-profit institutions serving households)",
                    "General government final consumption expenditure")
library(scales)

country_list_gdp <- tidy_GDP_data%>%
  filter(Country %in% country_list,IndicatorName %in% indicator_list )

ggplot(country_list_gdp,aes(x = as.numeric(year), group = IndicatorName))+
  geom_line(aes(y = value, color = IndicatorName))+
  facet_wrap(~Country)+
  theme_bw()+
  theme(legend.position = "right")+
  scale_x_continuous(breaks = scales::pretty_breaks(n=5))+
  labs(title="GDP components over time",
       subtitle="In constant 2010 USD",
       x="",
       y="Billion US$")+
  scale_color_discrete(labels = c("Gross capital formation",
                                "Exports",
                                "Goverment Expenditure",
                                "Household expenditure",
                                "Imports")
      )

Secondly, recall that GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in your dataframe, I would like you to calculate it given its components discussed above.

What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe?

tidy_gdp_wider <- tidy_GDP_data%>%
  pivot_wider(names_from = IndicatorName, values_from = value)
  
tidy_gdp_wider <- tidy_gdp_wider%>%
  mutate(net_export = tidy_gdp_wider[[indicator_list[2]]] - tidy_gdp_wider[[indicator_list[3]]])

tidy_gdp_wider <- tidy_gdp_wider%>%
  mutate(calc_gdp = tidy_gdp_wider[[indicator_list[1]]]+ 
                        tidy_gdp_wider[[indicator_list[4]]]+
                        tidy_gdp_wider[[indicator_list[5]]]+
                        tidy_gdp_wider[["net_export"]],
         prc_of_gdp_gross_capital = tidy_gdp_wider[[indicator_list[1]]]/calc_gdp * 100,
         prc_of_gdp_house = tidy_gdp_wider[[indicator_list[4]]]/calc_gdp * 100,
         prc_of_gdp_gov = tidy_gdp_wider[[indicator_list[5]]]/calc_gdp * 100,
         prc_of_gdp_net_exp = tidy_gdp_wider[["net_export"]]/calc_gdp * 100)



indicator_list_2 <- c("Gross capital formation",
                      "Household consumption expenditure (including Non-profit institutions serving households)",
                      "General government final consumption expenditure",
                      "net_export")

calc_gdp_longer <- tidy_gdp_wider%>%
  pivot_longer(cols = 22:26 , names_to = "IndicatorName")

calc_gdp_longer%>%
  filter(Country %in% country_list)%>%
  filter(IndicatorName %in% c("prc_of_gdp_gross_capital","prc_of_gdp_house","prc_of_gdp_gov","prc_of_gdp_net_exp"))%>%
ggplot(aes(x = as.numeric(year), group = IndicatorName))+
  geom_line(aes(y = value, color = IndicatorName))+
  facet_wrap(~Country)+
  theme_bw()+
  theme(legend.position = "right")+
  scale_color_discrete(labels = c("Goverment Expenditure",
                                "Gross Capital Formation",
                                "Gross Household Expenditure",
                                "Net Export")
      )+
  scale_x_continuous(breaks = scales::pretty_breaks(n=5))+
  labs(title="GDP and its breakdown at constant 2010 prices in US Dollars",
       x="",
       y="Billion US$")+
  scale_y_continuous(labels = function(y) sprintf("%.1f %%",y))

tidy_gdp_wider<- tidy_gdp_wider%>%
  mutate(calc_vs_real_gdp = calc_gdp / tidy_gdp_wider[["Gross Domestic Product (GDP)"]]-1 )

What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe? TEAM ANSWER:

The percentage difference changes from year to year depending on how much the factors within the table that we did not use in our calculation contributed to the total GDP.

What is this last chart telling you? Can you explain in a couple of paragraphs the different dynamic among these three countries? TEAM ANSWER:

All three countries have highest proportion of GDP in Household Expenditure and lowest proportion of GDP in Net Exports.

India has the highest household consumption proportion compared to Germany and the United States, yet it is decreasing over the years. In contrast, gross capital formation proportion increased significantly over the years. This might be due to India being a developing country where incomes flows towards investment capital consumption from household consumption capital.

Germany and the United states showcase similar pattern of all proportion being relatively stable over the years. Both countries are developed countries with stable development, which explains the relatively less volatile proportion changes.

If you want to, please change country_list <- c("United States","India", "Germany") to include your own country and compare it with any two other countries you like

6 Deliverables

There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

7 Details

  • Who did you collaborate with: Members of Study group 11
  • Approximately how much time did you spend on this problem set: 8
  • What, if anything, gave you the most trouble: ANSWER HERE

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.